Exploratory Data Analysis in R with ggplot2:

We are going to be looking at a dataset on wages and other data for a group of 3000 workers and learn how to use ggplot2 to answer questions about our data.

What is ggplot2? From their site ggplot2 is a plotting system for R, based on the grammar of graphics by Leland Wilkinson. Written by Hadley Wickham while he was a grad student at Iowa State

Loading the packages we need:

#uncomment the lines below if you don't have the packages.
#install.packages("ggplot2")
#install.packages("dplyr")
#install.packages("ISLR")

library(ggplot2) #data visualization
library(dplyr) #data manipulation
library(ISLR) #for the dataset
Wage = select(Wage, age, maritl, education, jobclass, wage) #select only some columns to make life easier
#View(Wage)

Let’s take a look at the datset:

head(Wage, 3)
##        age           maritl       education       jobclass      wage
## 231655  18 1. Never Married    1. < HS Grad  1. Industrial  75.04315
## 86582   24 1. Never Married 4. College Grad 2. Information  70.47602
## 161300  45       2. Married 3. Some College  1. Industrial 130.98218
str(Wage)
## 'data.frame':    3000 obs. of  5 variables:
##  $ age      : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl   : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ education: Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ wage     : num  75 70.5 131 154.7 75 ...

We have 2 numeric variables: age and wage, and 3 categorical variables: marital status, education and jobclass. It will be important to distinguish between different types of variables since each type will require different visualization techniques.

Now we’d like to visualize the data and learn more about ggplot while at it.

Some questions:

How do we make a plot in ggplot2? Here is the recipie for any plot:

ggplot(data, aes(variables)) + geoms

A few observations:

Just running ggplot gives us a blank canvas for our great visualizations:

ggplot()

If we add the variables, ggplot draws the axes only since we haven’t told it what type of plot we want.

ggplot(data = Wage, mapping = aes(x = age, y = wage))

Finally we can use geoms to tell ggplot what type of plot we want.

ggplot(data = Wage, aes(x = age, y = wage)) + 
    geom_point()

We have our first ggplot plot, pretty cool! The cool thing is that every plot we will ever make will have roughly the same structure:

ggplot(data, aes(variables)) + geoms

What is the distribution of the wage? (numeric)

ggplot(Wage, aes(x = wage)) +
    geom_histogram(bins = 40, color = "black") 

ggplot is more verbose than base R for simple / canned graphics but less verbose for complex / custom graphics.

This syntax might take a bit to get used to but once you have it set up all you have to do is change the geom and/or the variables inside aes and you can create any plot that you can think of.

Hint: if you don’t know what geom you want just type geom and press tab to see all possible geom’s.

Stack layers on top of each other using different geoms:

Ggplot’s real power comes when you want to build more complex graphics with multiple layers. Doing this in ggplot is easy: just add more geoms!

ggplot(data = Wage, mapping = aes(x = age, y = wage)) +
    geom_bin2d() +
    geom_point()

ggplot(data = Wage, mapping = aes(x = age, y = wage)) +
    geom_bin2d() +
    geom_point() +
    geom_density_2d()

Here we’re using 3 layers to get more info on the structure of the relationship. Is this a good visualization however? I’d say no, there is too much going on.

Inside the geoms you can fine tune your visualization:

Use ?geom_point if you don’t know what you can customize. Chances are if you want something changed ggplot can do it for you.

ggplot(Wage, aes(x = age, y = wage)) +
    geom_point(alpha = 0.5, color = "orange", size = 5, shape = 4) 

Dealing with Overplotting:

Let’s go back to the original question:

ggplot(data = Wage, mapping = aes(x = age, y = wage)) +
    geom_point()

Too many overlapping points. Let’s reduce the opacity of the points using alpha. And add a trend line using geom_smooth.

ggplot(data = Wage, mapping = aes(x = age, y = wage)) +
    geom_point(alpha = 0.5) +
    geom_smooth()

Other ways: jitter the points or use geom_count or geom_hex:

ggplot(Wage, aes(x = age, y = wage)) + geom_count()  #+ geom_jitter()

ggplot(Wage, aes(x = age, y = wage)) + geom_hex()

A closer look at aes - or how to pick your variables:

Say you’d like to color your points based on another categorical variable - all you have to do is place color = categorical variable inside the aes function. If it is not inside aes ggplot will not look in the dataframe for the respective column.

ggplot(data = Wage, mapping = aes(x = age, y = wage, color = education)) +
    geom_point()

Aes can be tricky but you can always overwrite the overall aes with putting an aes in your geom:

ggplot(data = Wage, mapping = aes(x = age, y = wage, color = education)) +
    geom_point(color = "black", aes(shape = maritl)) +
    #geom_smooth() +
    facet_grid(maritl ~ education)

Categorical Variables:

Getting back to another question: What is the relationship between education and wage?

ggplot(Wage, aes(x = education, y = wage)) +
    geom_boxplot()

ggplot(Wage, aes(x = education, y = wage, color = jobclass)) +
    geom_boxplot() 

Faceting - building multiple plots based on categorical variables in the data:

ggplot(Wage, aes(x = wage))  +
    geom_histogram()  +
    facet_wrap( ~ education)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = Wage, mapping = aes(x = age, y = wage)) +
    geom_point() +
    geom_smooth(aes(color = education)) +
    facet_wrap( ~ education) +
    ggtitle("Wage by Age and Education")

What is the relationship between age and marital status?

ggplot(Wage, aes(x = age, fill = maritl)) + 
    geom_histogram(bins = 20, color = "black", position = "fill") + 
    xlim(18, 60) + 
    scale_fill_brewer(palette = "Set1")
## Warning: Removed 173 rows containing non-finite values (stat_bin).
## Warning: Removed 5 rows containing missing values (geom_bar).

Categorical vs. Categorical? - This one’s a bit tougher. Maybe maritl and education

ggplot(Wage, aes(x = education, fill = maritl)) +
           geom_bar(position = "fill") 

Issues:

plotly::ggplotly(
ggplot(data = sample_n(Wage, 1000), mapping = aes(x = age, y = wage, color = education)) +
    geom_jitter())

Some dplyr stuff we probably won’t get to:

count(Wage, maritl) %>% 
    ggplot(aes(x = reorder(maritl, n), y = n)) +
    geom_bar(stat = "identity")

Wage %>% 
    group_by(age) %>% 
    summarise(mean_wage = mean(wage), count = n(), std = sd(wage)) %>% 
    mutate(se = std/sqrt(count)) %>% 
    ggplot(aes(x = age, y = mean_wage)) + geom_point() + geom_line() +
    geom_line(aes(y = mean_wage + 2*se), color = "grey") +
    geom_line(aes(y = mean_wage - 2*se), color = "grey")